arXiv:1505.07930vl [cs.CV] 29 May 2015 


Salient Object Detection via Augmented Hypotheses 


Tam V. Nguyen and Jose Sepulveda 

Department for Technology, Innovation and Enterprise 
Singapore Polytechnic 

{nguyen_van_tam, sepulveda jose} @sp.edu.sg 


Abstract 

In this paper, we propose using augmented hy¬ 
potheses which consider objectness, foreground 
and compactness for salient object detection. Our 
algorithm consists of four basic steps. First, our 
method generates the objectness map via object¬ 
ness hypotheses. Based on the objectness map, 
we estimate the foreground margin and compute 
the corresponding foreground map which prefers 
the foreground objects. From the objectness map 
and the foreground map, the compactness map is 
formed to favor the compact objects. We then 
derive a saliency measure that produces a pixel- 
accurate saliency map which uniformly covers the 
objects of interest and consistently separates fore- 
and background. We finally evaluate the proposed 
framework on two challenging datasets, MSRA- 
1000 and iCoSeg. Our extensive experimental re¬ 
sults show that our method outperforms state-of- 
the-art approaches. 



Figure 1: From top to bottom: original images, the objectness 
hypotheses, results of our saliency computation, and ground 
truth labeling. For a better viewing, only 40 object hypotheses 
are displayed in each image. 


1 Introduction 

The ultimate goal of salient object detection is to search for 
salient objects which draw human attention on the image. The 
research has shown that computational models_simulating 


low-level stimuli-driven attention [Koch and Ullman, 1985 
Itti et al.,_ 19981 are quite successful and represent useful 


tool s in many practical s cenarios, including im age resiz 
ing |Achanta et al. , 2009) , atten tion retargeting |Nguyen et 
al., 2013 a) , dynamic captioning [Nguyen et al., 2013bir im¬ 
age classification |Chen et al , 20121 and action recogni- 
tion [ Nguyen et al., 2015) . The existing methods can be clas¬ 
sified into biologically-inspired and computationally-oriented 
appro aches. On the one hand, works be longing to the first 
class [Itti etai, 199~8[|Cheng etai, 20fl| are generally based 
on the mod el proposed by Koch and Ullman [ Koch and Ull- 
man, 1985), in which the low-level stage processes features 
such as color, orientation of edges, or direction of movement. 
One example of this model is the work by Itti et al. [Itti et 
al ., 19981, which use a Difference of Gaussians approach to 
evaluate those features. However, the resulting saliency maps 
are generally blurry, and often overemphasize small, purely 


local features, which renders this approach less useful for ap¬ 
plications such as segmentation, detection, etc (Cheng et al ., 

2013 ). 

On the other hand, computational methods relate to typical 
applications in computer vision and graphi cs^ For example, 
frequency space methods [Hou and Zhang,^20071 determine 
saliency based on spectral residual of the Fourier transform 
of an image. The resulting saliency maps exhibit undesirable 
blurriness and tend to highlight object boundaries rather than 
its entire area. Since human vision is sensitive to color, differ¬ 
ent approaches use local or global analysis of color contrast. 
Local methods estimate the saliency of a particular image re¬ 


gion based on immediate image neighborhoods, _e.g., based 
on dissimilarities at the pixel-level [Ma and Zhang^_20031 
or histogram analysis [Cheng et al. , 2011 | T \Vhile such ap¬ 
proaches are able to produce less blurry saliency maps, they 
are agnostic of global relations and structures, and they may 
also be more sensitive to high frequency content like image 
edges and noise. In a global manner, [Achanta et al., 20091 
achieves globally consistent results by computing color dis¬ 
similarities to the mean image color. Murray et al. [Mur- 







































































Figur e 2: Saliency maps co mputed by our proposed AH method (t) and state-of-th e-art methods (a-r), salient region detection 
(AC lAchanta et al, 200 81), attention based on info miation maximization (AIM |Bruce and Tsotsos, 2005 1), context-awar e 
(CA [Goferman et al , 2 0101), frequency-tuned (FT l Achanta et al , 2009)), graph based saliency (GB |Harel et al , 2 0061), 
glo bal components (GC I Cheng et al , 2013[ ), glob al uniqueness (GU I Cheng et al ., 2013[ ), global contrast saliency (H C an d 
RC |Chen getal, 201 1[ ), spatial tempo ral cues (LC [ Zhai and Shah, 20061), visual attention m easurement (IT |Itti et al , 1 9981), 
maximum symmetric surround (MSS |Achanta and SiisStrunk, 2010|), fuzzy growing (MZ [Ma and Zhang, 20031), saliency 
filters (SF [Per azzi et al , 2012] ), induc tion model (SIM ||Mhirray et al., 2011) ), spectral residual (SR | |Hou and Zha ng, 2007[ ), 
saliency using natural statistics (SUN 1 Zhang et al ., 2008) ), and the objectness map (s). Our result (t) focuses on the main 
salient object as shown in ground truth (u). 


ray et al , 2011] introduced an efficient model of color ap¬ 
pearance, which contains a principled selection of parame¬ 
ters as well as an innate spatial pooling mechanism. There 
also exist different patch-based methods which estimate dis¬ 
similarity between image patches [ Goferman et a/.,_2010; 
Perazzi et al ., 20121. While these algorithms are more con¬ 
sistent in terms of global image structures, they suffer from 
the involved combinatorial complexity, hence they are appli¬ 
cable only to relatively low resolution images, or they need 
to operate in space s of reduced image dimensionality jBruce 
and Tsotsos, 2 0051, resulting in loss of salient details. 


Despite many recent improvements, the difficult question 
is still whether “the salient object is a real object”. That 
question bridges the problem of salient object detection into 
the traditional object detection research. In the latter ob¬ 
ject detection problem, the efficient sliding window object 
detection while keeping the computational cost feasible is 
very important. Therefore, there exist numerous objectness 
hypothesis generation methods proposing a small number 
(e.g. 1,000) of category-independent hypothe ses, that are ex 
pected to cover all objects in an image Lampert et al ., 2008 


Alexe et al , 2012 Uijlings et al ., 2013 Cheng et al , 2 0141. 

Objectness hypothesis is usually represented as a value which 
reflects how likely an image window covers^an object of 
any category. Lampert et al. [Lampert et al , _2008| intro¬ 
duced a branch-and-bound scheme for detection. However, 
it can only be used to speed up classifiers that users can pro 
vide a_good bound on highest score. Alexe et al. [Alexe et_ 


al , 2012J proposed a cue integration approach to get better 


predictio njerf ormance more efficiently. Uijlings et al. [Ui 
jlings et al , 2013) proposed a selective search approach to 
get higher prediction performance. However, these methods 
are time-consuming, taking 3 seconds for one image. Re¬ 
cently, Cheng et al. [Che ng et <2/.,_2014| presented a simple 
and fast objectness measure by using binarized normed gra¬ 
dients features which compute the objectness of each image 
window at any scale and aspect ratio only requires a few bit 


operations. This method can be run 1,000+ times faster than 
popular alternatives. 

In this work, we investigate applying objectness to the 
problem of salient object detection. We utilize the object 
hypotheses from the objectness hypothesis generation aug¬ 
mented with foreground and compactness constraint in or¬ 
der to produce a fast and high quality salient object detec¬ 
tor. The exemplary object hypotheses and our saliency pre¬ 
diction are shown in the second and the third row of Fig¬ 
ure [T] respectively. As we demonstrate in our experimental 
evaluation, each of our individual measures already performs 
close to or even better than some existing approaches, and 
our combined method currently achieves the best ranking re¬ 
sults on two public datasets provided by |Achanta et al , 2009 
Batra et al , 20101. Figure [ 2 ] shows the comparison of our 
saliency map to other baselines in literature. The main con¬ 
tributions of this work can be summarized as follows. 

• We conduct the comprehensive study on how the object¬ 
ness hypotheses affect the salient object detection. 

• We propose the foreground map and compactness map, 
derived from the objectness map, which can cover both 
global and local information of the saliency object. 

• Unlike other works in the literature, we evaluate our pro¬ 
posed method on two challenging datasets in order to 
know the impact of our work in different settings. 

2 Methodology 

In this section, we describe the details of our augmented hy¬ 
potheses (AH), and we show how the objectness measures as 
well as the saliency assignment can be efficiently computed. 
Figure [3] illustrates the overview of our processing steps. 

2.1 Objectness Map 

In this work, we extract object hypotheses from the input im¬ 
age to form the objectness map. We assume that the salient 
objects attract more object hypotheses than other parts in the 






























































































(a) Original image (b) Hypotheses (c) Objectness (d) Margin 


(e) Foreground (f) Compactness (g) Saliency map 


Figure 3: Illustration of the main phases of our algorithm. The object hypotheses are generated from the input image. The 
objectness map is later formed by accumulating all hypotheses. The foreground map is then created from the difference 
between the pixel’s color and the background color obtained following the estimated margins. We then oversegment the image 
into superpixels and compute the compactness map based on the spatial distribution of superpixels. Finally, a saliency value is 
assigned to each pixel. 


image. As aforementioned, the objectness hypothesis gen¬ 
erators propose a small number n p (e.g. 1,000) of category- 
independent hypotheses, that are expected to cover all objects 
in an image. Each hypothesis Pi has coordinate (k,U, r^bi), 
where liOi are the coordinate of the top left point, whereas 
ri , bi are the coordinate of the bottom right point. Here, we 
formulate each hypothesis Pi E where H and W are 

the height and the width of the input image /, respectively. 
The value of each element Pi(x, y ) is defined as: 


Pi 



if ti < x < bi and k < y < ri 
otherwise 


( 1 ) 


The objectness map is constructed by accumulating all object 
hypotheses: 

Tip 

OB(x,y) = Y j P i (x,y). (2) 

i =1 

The objectness map is later rescaled into the range [0..1]. 
We observe that the objectness map discourages the object 
parts locating close to the image boundary. Thus we extend 
the original image by embedding an image border with the 
size is 10% of the original image’s size. The addition im¬ 
age border is filled with the mean color of the original image. 
We perform the hypothesis extraction and compute the ob¬ 
jectness map similar to the aforementioned steps. The final 
objectness map is cropped to the size of the original image. 
Figure [4] demonstrates the effect of our image extension and 
the shrinkage of the objectness map. 



Figure 4: From left to right: the original image, the object hy¬ 
potheses and the corresponding objectness map, the extended 
object hypotheses and the corresponding objectness map. 


2.2 Foreground Map 

The salient object tends to be distinctive from its surround¬ 
ing context. Thus, we aim to model the background which 
can facilitate the object localization. In particular, the fore¬ 
ground map is computed by finding the difference between 
the color of the original image and the background image. In 
order to model the background, we first localize the salient 
object by the margin shown as the red rectangle in Fig [3ji. 
To this end, we compute the accumulate objectness level by 
four directions n r , namely, top, bottom, left, and right. For 
each direction, the accumulated objectness level is bounded 
by a threshold 6. To boost this process, we utilize the integral 
image I Viola and Jones, 200lj computed from the objectness 
map. Finally, there are n r , 4 in this work, corresponding rect¬ 
angles surrounding the salient object. Each bounding rectan¬ 
gle ri is represented by its mean color y r .. The foreground 
value computed for each pixel (x, y) is computed as follows, 


/ t'p 

FG(x,y) = Y[\\I(x,y) - n n \\, (3) 

i= 1 

where I(x, y) is the color vector of the pixel (x, y). 


2.3 Compactness Map 


The foreground map prefers the color of the salient ob¬ 
ject of the foreground. Unfortunately, it also favors the 
similar color appearing in the background. We observe 
that though the colors belonging to the background will be 
distributed over the entire image exhibiting a high spatial 
variance, the foreground objects are generally more com¬ 
pact [ Perazzi et al ., 2012 ]. Therefore, we compute the com¬ 
pactness map in order to remove the noise from the back¬ 
ground. First, we compute the centroid of interest (x c , y c ) = 


/ E (iC ,y) xxOF(x,y) J2( x , y) yxOF(x,y) 

^ Y, {x , y) OF(x,y) i Y, {x , y) OF(x : y) )’ 


where the objectness- 


foreground value OF(x,y) = OB(x,y) x FG(x,y). Intu¬ 
itively, the pixel close to the centroid of interest tends to be 
more salient, whereas the farther pixels tend to be less salient. 
In addition, the saliency value of a certain pixel reduces if the 
path between the centroid and that pixel contains many low 
saliency values. The naive method is to compute the path 
from the centroid of interest to other pixels. However, it is 
time-consuming to perform this task in the pixel-level. There¬ 
fore, we transform it to superpixel-level. The image is over¬ 
segmented into superpixels, and the OF value of a superpixel 































Algorithm 1 Superpixel compactness computation 

1: l = {v c }. 

2: c = 0 G M n ^. 

3: t = 0 

4: while / 7^ 0 do 

5: for each vertex Vi in l do 

6: for each edge u 7 ) do 

7: ifc(uj) < y/c(vi) x OF(vj) then 

8: c(^) <- v/c(vi) x OF(vj) 

9: t 

10 : end if 

11: end for 

12: end for 

13: l i — t 

14: t = 0 

15: end while 

16: return compactness values c of superpixels. 


is computed as the average OF values of all containing pix¬ 
els. The over-segmented image can be formulated as a graph 
G =s* (V, E), where V is the list of vertices (superpixels) and 
E is the list of edges connecting the neighboring superpixels. 

The procedure to compute the compactness values of su¬ 
perpixels is summarized in Algorithm [T] Denote v c as the 
superpixel containing the centroid of interest. The algorithm 
transfers the OF value from the v c to all other superpix¬ 
els. The procedure performs a sequence of relaxation steps, 
namely assigning the compactness value c(vj) of superpixel 
Vj by the square root of its neighboring superpixel’s compact¬ 
ness value and its own OF value. Our algorithm only relaxes 
edges from vertices Vj for which c(vj) has recently changed, 
since other vertices cannot lead to correct relaxations. Ad¬ 
ditionally, the algorithm may be terminated early when no 
recent changes exist. Finally, the compactness value CN is 
computed as: 


CN(x,y) = c(sp(x,y )), (4) 

where sp(x, y ) returns the index of the superpixel containing 
pixel (x,y). 

2.4 Saliency Assignment 

We normalize the objectness map OB , foreground map FG, 
and compactness map CN to the range [0.. 1]. We assume that 
all measures are independent, and hence we combine these 
terms as follows to compute a saliency value S for each pixel: 


S(x, y) = OB(x, y) x FG(x, y) x CN(x, y). (5) 

The resulting pixel-level saliency map may have an arbi¬ 
trary scale. In the final step, we rescale the saliency values 
within [0..1] and to contain at least 10% saliency pixels. 


of BING is two-fold. First, BING extractor has a weak train¬ 
ing from the simple feature, e.g., binarized normed gradients. 
Therefore, it is useful comparing to bottom-up edge extractor. 
Second, the BING extractor is able to run 10 times faster than 
real-time, i.e., 300 frames per second (fps). BING hypothe- 
sis generato r is trained with YOC20Q7 dat aset (Everingham 
et al , 2010) same as in [ Ch eng etal, 2014| . In order to com¬ 
pute the foreground map, 6 is set as 0.1 and we convert the 
color channels from RGB to Lab colorspace as suggested 
in [Achanta et al , 2009; Perazzi et al ., 2 012). Regarding 
the im age over-segmentation, we use SLIC I Achanta et al , 
2012f for the superpixel segmentation. We set the number 
of superpixels as 100 as a trade-off between the fine over¬ 
segmentation and the processing time. 


3 Evaluation 

3.1 Datasets and Evaluation Metrics 

We evaluate and compare the performances of our algorithm 
against previous baseline algorithms on two representative 
benchmark datasets: the MSRA 1000 salient object dataset 
[ Achanta et al , 2009| and the Intera ctive cosegmentation 
Dataset (iCoSeg) iBatra et al , 2010[ . The MSRA-1000 
dataset contains 1,000 images with the pixel-wise ground 
truth provided by [ Achanta et al , 20091. Note that each im¬ 
age in this dataset contains a salient object. Meanwhile, the 
iCoSeg contains 643 images with single or multiple objects 
in a single image. 

The first evaluation compares the precision and recall rates. 
High recall can be achieved at the expense of reducing the 
precision and vice-versa so it is important to evaluate both 
measures together. In the first setting, we compare binary 
masks for every threshold in the range [0..255]. In the second 
setting, we use the image dependent adaptive threshold pro¬ 
posed by [Achanta et al , 2009|, defined as twice the mean 
saliency of the image: 


W xH 


53 s ( x >y)- 


( 6 ) 


O ,y) 


In addition to precision and recall we compute their 
weighted harmonic mean measure or F — measure , which is 
defined as: 

p (1 + f3 2 ) x Precision x Recall 
13 /3 2 x Precision + Recall 

As in previous methods [Achanta et al, 2009[ Cheng et al , 
2013 Perazzi et al, 2012[ , we use /3' z = 0.3. 

For the second evaluation, we follow Perazzi et al. |Per- 
azzi et al, 2012) to evaluate the mean absolute error (MAE) 
between a continuous saliency map S and the binary ground 
truth G for all image pixels (x, y), defined as: 

MAE= w ^Y,\S(^y)-G^,y)\. (8) 

O ,y) 


2.5 Implementation Settings 

We apply the state-of-the-art objectness detection technique, 
i.e., binarized normed gradients (BING) [Cheng et aR 2014|, 
to produce a set of candidate object windows. Our selection 


3.2 Performance on MSRA1000 dataset 

Following [Achanta et al, 2009 Perazzi et al, 2012| Cheng 


et al, 2013) , we first evaluate our methods using a preci¬ 

sion/recall curve which is shown in Figure [5] Our work 
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(a) Fixed threshold 


(b) Adaptive threshold 


(c) Mean absolute error 


Figure 5: Statistical comparison with 18 saliency detection methods using all the 1000 images from MSRA-1000 dataset 
(Achanta et al , 20091 with pixel accuracy saliency region annotation: (a) the average precision recall curve by segmenting 
sali ency maps using fixed thr esholds, (b) the avera ge pr ecision recall by a daptive thresholding (using the same method as in 
FT [ Ach anta e t al , 2009) , SF jPerazzi et al.7W?\ , GC |Cheng et ai, 20131, etc.), (c) the mean absolute error of the different 
saliency methods to ground truth mask. Please check Figure [2] for the references to the publications in which the baseline 
methods are presented. 







SF SIM SR SUN Ours Ground truth 


Figure 6: Visual comparison of saliency maps on iCoSeg dataset. We compare our method (AH) to other 10 alternative methods. 
Our results are close to ground truth and focus on the main salient objects. 


reaches the highest precision/recall rate over all baselines. 
As a result, our method also obtains the best performance in 
terms of F-measure. We also evaluate the individual com¬ 
ponents in our system, namely, objectness map (OB), fore¬ 
ground map (FG), and compactness map (CN). They gener¬ 
ally achieve the acceptable performance which is comparable 
to other baselines. The performance of the objectness map it¬ 
self is outperformed by our proposed augmented hypotheses. 
In this work, our novelty is that we adopt and augment the 


conventional hypotheses by adding two key features: fore¬ 
groundness and compactness to detect salient objects. When 
fusing them together, our unified system achieves the state- 
of-the-art performance in every single evaluation metric. 


As discussed in the SF [Perazzi _et al , 2012) and 
GC [ Cheng et al ^2013.1, neither the precision nor recall mea¬ 
sure considers the true negative counts. These measures favor 
methods which successfully assign saliency to salient pixels 
but fail to detect non-salient regions over methods that sue- 

































































































































































































(a) Fixed threshold 


(b) Adaptive threshold 


(c) Mean absolute error 


Figure 7: Statistical comparison with 10 saliency detection methods using all the 643 images from iCoSeg benchmark [Batra et 
al^ 2010| with pixel accuracy saliency region annotation: (a) the average precision recall curve by segmenting saliency maps 
using fixed thresholds^ (b) the average precision recall by adaptive thresholding (using the same method as in FT [Achanta et 
al , 20091, GC [Cheng et al ., 2013J, etc.), (c) the mean absolute error of the different saliency methods to ground truth mask. 


cessfully do the opposite. Instead, they suggested that MAE 
is a better metric than precision recall analysis for this prob¬ 
lem. As shown in Figure [5]:, our work outperforms the state- 
of-the-art performance [Che ng et al ., 20131 by 24%. One 
may arg ue that a simple bo osting of saliency values similar 
as in [Perazzi et al ., 2012J results would improve it. How¬ 
ever, a boosting of saliency values could easily result in the 
boosting of low saliency values related to background that we 
also aim to avoid. 


3.3 Performance on iCoSeg dataset 

The iCoSeg dataset is “less popular” in the sense that some 
baselines do not even release detection results and source- 
code. We only reproduced 10 methods on iCoSeg thanks to 
their existing source-code. The visual comparison of saliency 
maps generated from our method and different baselines are 
demonstrated in Figure [6] Our results are close to ground 
truth and focus on the main salient objects. We first evaluate 
our methods using a precision/recall curve which is shown in 
Figure [7^, b. Our method outperforms all other baselines in 
both two settings, namely fixed threshold and adaptive thresh¬ 
old. As shown in Figure |7]c, our method achieves the best 
performance in terms of MAE. Our work outperforms other 
methods by a large margin, 25%. 


3.4 Computational Efficiency 

It is also worth investigating the computational efficiency of 
different methods. In Table [I] we compare the average run¬ 
ning time of our approach to the currently best performing 
methods on the benchmark images. We compare the perfor¬ 
mance of our method in terms of speed with meth ods with 
most competitive accuracy (GC [Che ng etai, 2013) , SF [ Per¬ 


azzi et al . , 2012[ ). The average time of each method is mea¬ 


sured on a PC with Intel i7 3.3 GHz CPU and 8GB RAM. Per¬ 
formance of all the methods compared in this table are based 
on implementations in C++ and MATFAB. The CA method 
the slowest one because it requires an exhaustive nearest- 
neighbor search among patches. Meanwhile, our method is 


Table 1: Comparison of runnin; 
benchmark [Achanta et al , 20091. 


times in the MSRA 1000 


Method 

CA 

RC 

SF 

GC 

Ours 

Time (s) 

51.2 

0.14 

0.15 

0.09 

0.07 

Code 

Matlab 

C++ 

C++ 

C++ 

C++ 


able to run in a real-time manner. Our procedure spends most 
of the computation time on generating the objectness map 
(about 35%) and forming the compactness map (about 50%). 
From the experimental results, we find that our algorithm is 
effective and computationally efficient. 

4 Conclusion and Future Work 

In this paper, we have presented a novel method, augmented 
hypotheses (AH), which adopts the object hypotheses in or¬ 
der to rapidly detect salient objects. To this end, three maps 
are derived from object hypotheses: superimposed hypothe¬ 
ses form an objectness map, a foreground map is computed 
from deviations in color from the background, and a com¬ 
pactness map emerges from propagating saliency labels in the 
oversegmented image. These three maps are fused together 
to detect salient objects with sharp boundaries. Experimental 
results on two challenging datasets show that our results are 
24% - 25% better than the previous best results (compared 
against 10+ methods in two different datasets), in terms of 
mean absolute error while also being faster. 

For future work, we aim to investigate more sophisticated 
techniques for objectness me asures and i ntegrate more cues, 
i.e., depth |Fan get al , 2012f and audio |Chen et al , 2014[ 
information. Also, we would like to study the impact of 
salient object detection into the object hypothesis process. 
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